Data Manipulation with Pandas



In [ ]:

    
import pandas as pd
pd.set_option('max_rows', 10)

Categorical Types

Pandas provides a convenient dtype for reprsenting categorical, or factor, data



In [ ]:

    
c = pd.Categorical(['a', 'b', 'b', 'c', 'a', 'b', 'a', 'a', 'a', 'a'])



In [ ]:

    
c



In [ ]:

    
c.describe()



In [ ]:

    
c.codes



In [ ]:

    
c.categories

By default the Categorical type represents an unordered categorical
You can provide information about the order of categories



In [ ]:

    
c.as_ordered()

Support in DataFrames

When a Categorical is in a DataFrame, there is a special cat accessor
This gives access to all of the features of the Categorical type



In [ ]:

    
dta = pd.DataFrame.from_dict({'factor': c,
                              'x': np.random.randn(10)})



In [ ]:

    
dta.head()



In [ ]:

    
dta.dtypes



In [ ]:

    
dta.factor.cat



In [ ]:

    
dta.factor.cat.categories



In [ ]:

    
dta.factor.describe()

Exercise

Load NFS data again. Convert fditemno to a Categorical Type. Use describe.



In [ ]:

    
# [Solution Here]



In [ ]:

    
%load solutions/load_nfs_categorical.py

Date and Time Types

Pandas provides conveniences for working with dates

Creating a Range of Dates



In [ ]:

    
dates = pd.date_range("1/1/2015", periods=75, freq="D")
dates



In [ ]:

    
y = pd.Series(np.random.randn(75), index=dates)
y.head()



In [ ]:

    
y.reset_index().dtypes

Support in DataFrames

When a datetime type is in a DataFrame, there is a special dt accessor
This gives access to all of the features of the datetime type



In [ ]:

    
dta = (y.reset_index(name='t').
       rename(columns={'index': 'y'}))



In [ ]:

    
dta.head()



In [ ]:

    
dta.dtypes



In [ ]:

    
dta.y.dt.freq



In [ ]:

    
dta.y.dt.day

Indexing with Dates

You can use strings
Note: the ending index is inclusive here. This is different than most of the rest of Python



In [ ]:

    
y.ix["2015-01-01":"2015-01-15"]

DatetimeIndex supports partial string indexing



In [ ]:

    
y["2015-01"]

You can resample to a lower frequency, specifying how to aggregate
Uses the DateTeimIndexResampler object



In [ ]:

    
resample = y.resample("M")



In [ ]:

    
resample.mean()

Or go to a higher frequency, optionally specifying how to fill in the



In [ ]:

    
y.asfreq('H', method='ffill')

There are convenience methods to lag and lead time series



In [ ]:

    
y



In [ ]:

    
y.shift(1)



In [ ]:

    
y.shift(-1)

Rolling and Window Functions

Pandas also provides a number of convenience functions for working on rolling or moving windows of time series through a common interface
This interface is the new Rolling object



In [ ]:

    
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', 
                                                          periods=1000))
ts = ts.cumsum()



In [ ]:

    
rolling = ts.rolling(window=60)
rolling



In [ ]:

    
rolling.mean()

Exercise

Create a datetime colume named 'date' for NFS_1974.csv NFS diary data
- styr: Survey year
- stmth: Survey month
- logday: Day in the log (assume logdays are actual days)
Hint: You could do this in two ways
- Look at the parse_dates keyword of read_csv
- Create the date after reading in the DataFrame



In [ ]:

    
# [Solution here]



In [ ]:

    
%load solutions/load_nfs_datetime.py

Merging and Joining DataFrames



In [ ]:

    
# this is a bit slow because of the date parsing

transit = pd.read_csv("../data/AIS/transit_segments.csv", 
                      parse_dates=['st_time', 'end_time'],
                      infer_datetime_format=True)
vessels = pd.read_csv("../data/AIS/vessel_information.csv")

A lot of the time data that comes from relational databases will be normalized
I.e., redundant information will be put in separate tables
Users are expected to merge or join tables to work with them



In [ ]:

    
vessels.head()



In [ ]:

    
transit.head()

Several ships in the vessels data have traveled multiple segments as we would expect
Matching the names in the transit data to the vessels data is thus a many-to-one match

aside pandas Indices (of which Columns are one) are set-like



In [ ]:

    
vessels.columns.intersection(transit.columns)

Merging

We can combine these two datasets for a many-to-one match
merge will use the common columns if we do not explicitly specify the columns



In [ ]:

    
transit.merge(vessels).head()

Watch out, when merging on columns, indices are discarded



In [ ]:

    
A = pd.DataFrame(np.random.randn(25, 2), 
                 index=pd.date_range('1/1/2015', periods=25))
A[2] = np.repeat(list('abcde'), 5)
A



In [ ]:

    
B = pd.DataFrame(np.random.randn(5, 2))
B[2] = list('abcde')
B



In [ ]:

    
A.merge(B, on=2)

Joins

Join is like merge, but it works on the indices
The same could be achieved with merge and the left_index and right_index keywords



In [ ]:

    
transit.set_index('mmsi', inplace=True)
vessels.set_index('mmsi', inplace=True)



In [ ]:

    
transit.join(vessels).head()

Exercise

Join the 1974 Household NFS data with the Diary data
The data is in ../data/NationalFoodSurvey/NFS_1974/



In [ ]:

    
%load solutions/join_nfs.py

Concatenation

Another common operation is appending data row-wise or column-wise to an existing dataset
We can use the concat function for this
Let's import two microbiome datasets, each consisting of counts of microorganisms from a particular patient.
We will use the first column of each dataset as the index.
The index is the unique biological classification of each organism, beginning with domain, phylum, class, and for some organisms, going all the way down to the genus level.



In [ ]:

    
df1 = pd.read_csv('../data/ebola/guinea_data/2014-08-04.csv', 
                  index_col=['Date', 'Description'])
df2 = pd.read_csv('../data/ebola/guinea_data/2014-08-26.csv',
                 index_col=['Date', 'Description'])



In [ ]:

    
df1.shape



In [ ]:

    
df2.shape



In [ ]:

    
df1.head()



In [ ]:

    
df2.head()



In [ ]:

    
df1.index.is_unique



In [ ]:

    
df2.index.is_unique

We can concatenate on the rows



In [ ]:

    
df = pd.concat((df1, df2), axis=0)
df.shape

Exercise

Join all of the diary data together in a single DataFrame
- Hint: you might find glob.glob useful
You will need to add a unique field identifying the survey year to each DataFrame
- Hint: you might find a regular expression using re.search useful



In [ ]:

    
# [Solution here]



In [ ]:

    
%load solutions/concat_nfs.py

Text Data Manipulation

Much like the cat and dt accessors we've already seen
String types have a str accessor that provides fast string operations on columns



In [ ]:

    
vessels.type

Count the vessel separators



In [ ]:

    
vessels.type.str.count('/').max()

Split on these accessors and expand to return a DataFrame with nan-padding



In [ ]:

    
vessels.type.str.split('/', expand=True)

Exercise

Load the file "Ref_ food groups.txt"
Get all of the food groups that contain the word milk



In [ ]:

    
# [Solution here]



In [ ]:

    
%load solutions/nfs_dairy.py